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1  Objective 

The  objective  of  this  research  program  was  to  develop  mathematical  foundations  of  information  gathering 
through  an  integrated  theory  of  sensing,  inference,  and  control.  The  goal  of  the  team  was  to  develop  a  new 
framework  for  autonomous  operations  that  will  extend  the  state  of  the  art  in  distributed  learning  and  modeling 
from  data,  and  tightly  integrate  these  models  into  new  decentralized  cooperative  planning  algorithms.  The 
main  output  of  this  effort  will  be  a  fundamental  theory  to  integrate  decentralized  information  driven  planning 
methods  for  heterogenous  teams  with  nonparametric  Bayesian  models  of  uncertainty.  The  feasibility  and 
aspects  of  the  value  of  the  theory  were  demonstrated  via  integrated  software  and  hardware  experiments. 

Phase  I  included  an  extensive  set  of  mathematical  and  algortihmic  developments  which  formed  the  basis 
of  an  integrated  system.  Bayesian  inference  represented  by  graphical  models  mediated  between  sensors  and 
event  probabilities  of  interest.  Temporal  Logic  mediated  between  the  use  of  graphical  models  for  inference 
and  the  interpretation  of  system  queries.  In  the  proposed  architecture,  constructive  Temporal  Logic  approach 
reduces  first-order  logic  queries  to  a  system  of  graphical  models. 

During,  phase  2  algorithmic  development  emphasized  transitioning  from  ensor-centric  to  scene-centric 
processing.  As  such,  issues  such  as  sensing  geometry  and  the  associated  nuisance  parameters,  noisy  and 
missing  data,  and  mult- view  and  multi-modal  sensing  were  important  considerations  for  modeling  and 
development.  Methods  to  exploit  information  measures  and  their  relation  to  the  instantiated  graphical 
structures  were  developed  to  investigate  the  trade  off  computational  resource  costs  with  the  quality  of 
approximate  inference  methods.  Hierarchical  Bayesian  nonparametric  methods  were  investigated  for  the 
purpose  of  modleing  both  contextual  representations  and  specific  instances  of  object,  attributes  and  relations 
envisioned  under  the  program. 

While  a  significant  aspect  of  MSEE  Phase  II  and  m  was  devoted  system  development,  it  is  still  the  case 
that  fundamental  research  in  distributed  planning  and  control,  sensor  and  information  management,  and 
intent  recognition  were  investigated  to  achieve  the  amitious  goals  of  the  program. 

2  Overview 

We  provide  an  overview  of  the  system  developed  by  the  MIT  team  as  well  as  a  description  of  the  research 
results  which  are  further  detailed  in  technical  publication  listed  at  the  end  of  this  report. 

2.1  Team  Members 

Table  1  lists  the  various  key  members  of  the  team  (by  institution)  and  their  primary  areas  of  expertise  and 
responsibilities. 


Org 

Capabilities  &  Responsibilities 

Key  Personnel 

MIT 

BNP  Models,  Inference,  &  Planning. 

Dr.  John  Fisher,  Prof.  Jon  How 

ICSI 

BNP 

Models,  Large  scale  object  recognition. 

Prof.  Trevor  Darrell 

UCLA 

3D/Geometric  scene  representation 

Prof.  Stefano  Soatto 

ETH  Zurich 

Discrete  and 

mixed  integer-continuous  optimization. 

Prof.  Andreas  Krause 

BAE  Systems 

Temporal  logic  &  system  integration 

Dr.  Luis  Galup, 

Ms.  Wendy  Mungovan,  Mr.  Manuel  Cuevas 

Table  1:  Team  members  and  primary  technical  expertise.  Note  that  Prof.  Krause  joined  the  team  at  the 
beginning  of  Phase  2,  while  Dr.  Galup  and  Ms.  Mungovan  left  the  team  at  the  completion  of  Phase  2. 
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Figure  1:  Function  system  block  diagram  of  MIT  MSEE  SUT  implementation. 


2.2  System  Description 

Figure  1  depicts  the  funtional  system  block  diagram  of  the  MIT  MSEE  SUT  implementation.  Communication 
with  the  EES,  query  ingestion,  query  parsing,  and  predicate  tasking  are  performed  within  the  SUT  Framework 
developed  by  BAE.  Scene  modeling  including  labeling  of  moving  and  static  objects  as  well  as  determining 
3D  geometry  are  performed  off-line  and  stored  in  a  database.  Geometric  modeling  is  performed  by  modules 
developed  by  UCLA  while  object  tracking  and  scene  labeling  are  performed  by  modules  developed  by 
MIT.  Finally,  object  labeling  (including  tracked  objects)  are  also  performed  off-line  using  a  variant  of  Caffe 
developed  by  ICSI.  All  results  are  stored  in  postgres  databases  for  later  indexing  during  query  processing. 

The  goal  was  to  develop  a  working  system  for  query-based  scene  understanding  that  integrates  physical 
sensor  models  of  video  cameras,  Bayesian  reasoning  via  structured  graphical  models  and  integration  of 
contextual  models.  Following  the  Phase  2  demonstration,  the  team  had  produced  a  functioning  end-to-end 
system  demonstrating  the  following  functionality: 

•  large  scale  object  classification, 

•  semi-automated  3D  scene  modeling, 

•  extensible  system  for  predicate  implementation, 

•  ability  to  reason  over  geometric,  dynamic,  and  behavioral  relations 

The  Phase  2  system  emphasized  sensor-centric  processing  for  predicate  reasoning  with  extensions  to 
3D  reasoning  aided  by  3D  scene  representation.  An  intial  working  version  of  the  system  was  transitioned 
to  Air  Force  Research  Laboratory.  Recent  extensions  are  in  the  process  of  being  transitioned,  as  well. 

2.3  Processing  Flow 

Figure  2  depicts  the  conceptual  approach  of  the  MIT  MSEE  design.  Here,  an  intermediate  representation 
comprised  of  -  (1)  a  scene  represenation,  (2)  object  and  mover  attribution,  and  (3)  tracking  of  movers 
-  separates  sensing  from  reasoning.  The  advantage  is  that  reasoning  can  be  defined  in  terms  physical 
relations  (as  paramerized  by  the  representation)  and  logical  functions.  Queries  (as  prescribed  by  the  formal 
language  specification)  are  comprised  of  predicates  which  are  defined  deterministically  over  the  intermediate 
representation.  As  such,  uncertainty  is  modeled  in  the  intermediate  representation  (e.g.  due  to  sensor  noise 
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Figure  2:  Conceptual  Diagram  of  MIT  MSEE  Design 


and  model  mismatch)  rather  than  in  the  reasoning  system. 

As  a  result,  predicates  are  mapped  to  collections  of  inference  algorithms  implemented  as  modular  and 
composable  probabilistic  graphical  models.  Conceptually  one  could  instantiate  a  monolithic  model  and 
focus  inference  on  the  relevant  latent  variables,  however,  for  the  complexity  of  the  scenes  contemplated 
by  MSEE  and  the  number  of  sensors,  such  an  approach  is  intractable.  An  additional  (and  substantial)  benefit 
of  the  modular  approach  is  that  it  allows  efficient  and  principled  handling  of  nuisance  parameters  only  when 
necessary,  optimization  of  the  measurement  process,  as  well  as  instantiation  of  only  those  aspects  of  the 
representation  that  are  relevant  to  the  query.  The  modular  approach  also  easily  lends  itself  to  parallelization. 

We  note  that  graphical  models  are  not  a  panacea,  rather  they  are  a  framework.  Whle  they  aid  in  organizing 
relationships  between  queries,  sensors,  and  the  scene  while  making  dependency  assumptions  explicit,  they 
only  suggest  methods  for  inference.  The  critical  choice  of  how  to  perform  inference  in  a  given  graphical 
model  is  left  to  the  designer  and  will  depend  on  the  definitions  of  predicates  which  reason  over  that  graphical 
model.  That  being  said,  the  modular  approach  allows  these  models  to  be  designed  independently. 

3  System  Performance 

3.1  Predicate  Handling  Framework 

Predicate  analysis  and  evaluation  are  implemented  as  a  separate  module  (denoted  by  the  red  box  in  Figure 
3.  In  the  MIT  design  and  implementation,  predicates  results  are  treated  as  independent.  This  choice  was 
made  for  practical  reasons  due  to  the  fact  that  modeling  (and  reasoning)  over  dependent  predicates  is  not 
feasible  given  the  number  of  relations  the  system  would  have  to  consider.  Treating  them  independently  is 
akin  to  making  what  is  known  at  the  naive  Bayes  assumption.  One  practical  consequence  is  that  predicates 
can  be  evaluated  in  parallel  allowing  for  significant  speedups  in  analysis.  Predicates  are  roughly  grouped 
into  three  categories,  behavior  predicates,  relationship  predicates,  and  action  predicates.  These  groupings 
are  shown  in  Table  2. 

As  currently  implemented,  incorporation  of  new  predicates  is  a  straightforward  process  of  defining  the 
predicate  as  a  logical  function  of  its  inputs  and  their  relation  to  the  physical  properties  of  the  scene.  For 
example,  the  predicate  “together”  is  defined  in  terms  of  the  proximity  of  the  arguments  specified  in  physical 
units  (when  available)  or  in  terms  of  sensor  dimensions  (e.g.  pixels)  when  the  physical  units  are  not  availalbe. 
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Figure  3:  Predicate  evaluation  is  implemented  as  a  separate  module,  denoted  by  red  box  in  figure. 
Table  2:  Predicate  categorization  and  implementation  status. 


Behavior 

Relationships 

Actions 

,  ,  „  .  Not 

Implemented  ,  ,  , 

Implemented 

Implemented 

Not 

Implemented 

Implemented 

Not 

Implemented 

1.  Starting 

1.  Same-object 

1. 

Touching 

1.  Driving 

1. 

Loading 

2.  Moving 

2.  Part-of 

2. 

Facing 

2.  Entering 

2. 

Unloading 

3.  Stopping 

3.  CLOS 

3. 

Facing- 

3.  Exiting 

3. 

Donning 

4.  Stationary 

4.  Occluding 

opposite 

4.  Crossing 

4. 

Doffing 

5.  Turning 

5.  On 

4. 

Inside 

5.  Carrying 

5. 

Wearing 

6.  Turning-right 

6.  Together 

5. 

Outside 

6.  Mounting 

6. 

Swinging 

7.  Turning-left 

7.  Closer 

6. 

Putting-in 

7.  Dismounting 

8.  U-turn 

8.  Father 

8.  Putting-up 

9.  Crawling 

9.  Below 

9.  Taking-down 

10.  Walking 

10.  Same-motion 

10.  Throwing 

11.  Running 

11.  Opposite- 

11.  Catching 

12.  Sitting 

motion 

12.  Putting-down 

13.  Standing 

12.  Following 

13.  Picking-up 

14.  Talking 

13.  passing 

14.  Dropping 

15.  Writing 

16.  Reading 

17.  Eating 

18.  Pointing 

19.  Open 

20.  Closed 


The  former  is  always  possible  so  long  as  the  scene  properties  have  been  specified  (described  elsewhere) 
in  which  case  the  predicate  makes  use  of  so-called  “helper  functions”  used  to  define  the  relation  of  predicate 
arguments  to  the  scene  being  analyzed.  Whether  to  utilize  the  physical  dimensions  of  the  scene  (which 
is  subject  to  sensor  uncertainty)  and  the  associated  helper  functions  is  left  to  the  predicate  designer. 

Details  of  the  predicate  handling  framework  are  shown  in  the  system  block  diagram  of  Figure  4.  The 
predicate  handling  framework  (1)  interfaces  with  the  MSEE  framework  (i.e  the  system  which  receives  the 
query  from  the  EES  and  parses  it,  (2)  accesses  the  database  of  precomputed  analysis  (tracks  of  movers, 
labels  of  objects,  and  the  geometric  description  of  the  scene),  (3)  determines  the  order  and  combination 
of  which  predicates  to  evaluate,  and  (4)  handles  various  special  cases  and  checks  for  errors. 

The  syntax  for  the  MSEE  framework  call  to  the  predicate  handling  framework  (circle  1  in  Figure  4)  is 
shown  in  table  3.  Having  received  the  predicate  call  from  the  MSEE  framework,  the  predicate  handling 
framework  separate  predicate  calls  for  each  valid  combination  of  unary,  binary,  or  ternary  arguments  along 
with  associated  track  and  scene  info.  The  syntax  for  calling  a  specific  instance  of  a  predicate  (circle  2  in 
Figure  4)  is  shown  in  Table  4.  While  the  MSEE  framework  can  parallelize  calls  to  the  predicate  handling 
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Figure  4:  Predicate  handling  framework.  Predicates  access  the  results  of  sensor  data  processing  via 
a  database  of  pre-computed  analysis  including  3D  Scene  analysis,  tracking  of  moving  objects,  and 
classification  of  moving  and  static  objects. 


Table  3:  Syntax  for  MSEE  framework  call  to  predicate  handling  framework 


Syntax: 


Database 

Interface 


Data  Manager 

puts: 


a  jnpu 

' — Predi 


Framework 

ln,erface  Timd 


Determine 

Predicates 


Predicate 


poeJtilT  (<preoii(Skfr£^,  <time  window  start>,<time  window  end>, 

<objl>, [obj2] , [obj3] ) ; 

Output: 


List  ofj  objects  for 
inputs  (videoID,  tr; 


Predicate 

Evaluator 


Track 

storage 


3D  Storage 


Window  of  interest  (string). 


efrdj)  of  the  predicate 
■acklpT 


Matrix  containing  probability  of  the 
predicate  being  true  for  each  of  the  object 
combinations. 


■  TajiHeyTO^T^hcethe  query  is  parsed)  across  predicates,  further  parallelization  is  possible  within  the  predicate 
handling  framework  across  instances  of  argument  combinations. 

3.2  Predicate  Processing  Time: 

Figures  5  and  6  provide  details  of  the  processing  time  broken  down  by  predicate.  Recall  that  tracking,  scene 
construction,  and  object  labeling  are  peformed  ahead  of  any  query  time.  Consequently,  the  values  in  these 
figures  reflect  the  time  the  complete  predicate  reasoning  and  data  base  access  times  and  do  not  include 
sensor  processing  time.  In  future  implementations,  it  would  be  straightforward  t3  store  sensor  pqqppsirig 
time  as  part  of  the  pre-processing  step.  This  would  allow  analysis  that  computes  b&thrseaa&or  ~ 

time  and  logic  processing  time.  Both  depend  on  the  complexity  of  the  query,  the  complexity  o 
the  number  of  sensors,  the  time  duration  over  which  the  query  is  applied. 

Figure  5  reflects  the  total  time  to  process  each  predicate  for  a  given  query.  For  .a-given  quei 
be  the  time  to  process  all  valid  arguments  for  a  specific  predicate.  As  seen  in  the  figuna*amost 


^ ^  Framework 

►roccssmg  interface 
ie  scen^- 


s  wou 

rrj^kgData  Manager 


it: 


take  very  little  time  to  process.  Multiple  values  for  a  given  predicate  reflect  thakthe  predicat^was  usdd- 
in  more  than  one  query.  The  differences  in  processing  time  are  a  consequence  of  thenuffiferof  arguments- 


Predicate 
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Table  4:  Syntax  for  predicate  handling  framework  to  individual  predicate  instances. 

Syntax: 

predicate ptr  (info,  objs,  tracks,  scene 3d,  params)  ; 


Inputs: 

Output: 

General  Info  (cell  array) 

Structure  containing  indicator  whether 
predicate  is  true  or  false,  and  associated 
probability. 

Track  Instances  (cell  array). 

Tracks  (cell  array). 

3D  Representation  (function  pointers). 

Predicate  parameters  (structure) 

Processing  time  as  function  of  predicate 


6000 


5000 


%  4000 


,3000 


!  2000 


1000 


Figure  5:  Processing  Time  as  a  Function  of  Predicates 


passed  to  the  predicate  for  that  particular  query.  These  values  are  more  reflective  of  the  complexity  of  the 
various  queries  used  for  Phase  2  testing.  Figure  6  reflects  the  time  to  process  each  predicate  for  a  single 
instance.  Here  the  differences  in  processing  time  are  reflective  of  the  temporal  duration  associated  with 
the  particular  instance  of  the  predicate  evaluation. 

3.3  Query  Accuracy: 

Phase  2  involved  276  queries  submitted  to  the  query.  Of  those,  218  queries  were  processed.  Some  predicates 
were  not  supported  and,  as  a  result,  any  query  which  contained  those  predicates  was  not  processed  (a  total 
of  58).  The  218  processed  queries  resulted  in  390  predicate  calls.  This  is  indicative  of  the  fact  that  many 
queries  were  comprised  of  a  single  predicate  and  very  few  queries  incorporated  4  or  more  predicates  (see 
Figure  8(left)) 

The  system  performance  for  218  queries  is  detailed  in  Figure  7.  The  table  at  the  left  of  the  figure  provides 
counts  of  true  positives,  false  positives,  true  negatives,  and  false  negatives.  The  chart  at  the  right  depicts 
the  relative  percentages.  We  note  that  the  system  as  implemented  has  a  bias  towards  returning  a  “true”  value. 
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Figure  6:  Processing  Time  Normalized  by  Predicate  Evaluations 


Figure  7:  Query  Performance. 


This  is  due  to  interpreting  a  query  (or  predicate)  as  being  true  for  a  given  time  period  even  if  it  is  true  only 
once  (i.e.  at  a  single  point  in  time).  The  consequence  is  that  as  the  time  period  grows,  even  if  a  predicate 
reports  a  low-probability  of  being  hue  at  every  time  instance,  the  overall  probability  approaches  unity  as 
the  length  of  the  time  period  grows.  This  is  perhaps  the  simplest  interpretation  of  what  constitutes  a  query 
or  predicate  being  true.  Other  approaches  could  be  adopted,  but  were  not  investigated. 

Not  all  predicates  are  equal:  While  the  figure  7  reflects  average  performance  for  the  system  when 
evaluated  over  the  choice  of  queries  for  phase  2,  it  is  unlikely  that  it  accurately  reflects  the  overall  system 
performance  as  the  queries  chosen  for  testing  were  biased  towards  the  use  of  a  small  number  of  predicates. 
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Figure  8:  (left)  Breakdown  of  number  of  predicates  per  query,  (middle)  wordle  where  the  size  of  the 
predicate  name  reflects  the  usage  frequency  across  queries,  and  (right)  pie-chart  with  counts  of  predicate 
usage  across  queries. 


Figure  9:  Accuracy  of  each  predicate  across  all  queries. 


This  can  be  seen  in  Figure  8  which  visualize  the  relative  frequency  with  which  predicates  were  used  within  the 
phase  2  testing  queries.  As  can  be  seen,  “part-of’  and  “same-object”  were  called  significantly  more  often,  47 
and  45  times  each,  as  compared  to  “together”  which  was  called  once.  Consequently,  the  query  performance 
numbers  are  largely  reflective  of  the  performance  on  the  most  frequently  called  predicates.  Whether  this 
is  an  accurate  reflection  depends  on  the  anticipated  scenarios  in  which  such  a  system  would  be  used. 

Figure  9  shows  the  relative  accuracy  of  each  predicate  where  blue  reflects  the  number  of  times  the 
predicate  was  called  and  green  the  number  of  times  the  predicate  returned  a  correct  answer.  We  note  that 
performance  on  many  queries  is  substantially  above  guessing,  however,  on  three  of  the  most  frequently 
called  predicates,  “on”,  “part-of’,  and  “same-object”,  the  performance  is  fairly  poor  resulting  in  a  larger 
impact  on  system  performance. 
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4  Preprocessing 

4.1  3D  Scene  Modeling 

The  methodology  for  constructing  3D  information  of  the  scene  is  described  in  the  material  provided  in  this 
section.  It  was  noted  during  the  course  of  the  program  that  the  quality  of  the  reconstruction,  upon  which  accu¬ 
rate  spatial  reasoning  depends,  is  impacted  by  both  the  accuracy  of  the  intrinsic  parameters  of  the  cameras  and 
knowledge  of  the  sensing  geometry.  The  former  was  provided,  but  the  latter  was  not.  Consequently,  state-of- 
the-art  methods  employing  automatic  detection  of  correspondences  were  utilized.  The  accuracy  of  these  meth¬ 
ods  depends  greatly  on  both  the  sensor  geometry  and  the  content  of  the  scene.  For  some  of  the  scenes,  these 
were  not  adequate  to  yield  acceptable  performance  and  as  a  result,  manual  correspondences  were  needed. 
Details  of  the  methodology  are  found  in  Section  8.1. 

4.2  Boundary  Accurate  Tracker 

As  part  of  pre-processing  the  MIT  design  tracks  all  movers,  storing  the  results  in  a  database.  Both  the 
location  (within  the  sensor  view)  and  the  boundary  of  the  object  are  computed.  The  tracker  was  partially 
developed  under  the  MSEE  program  and  implements  layered  tracking,  adaptive  appearance  models,  and 
occlusion  reasoning. 

Details  of  the  methodology  are  found  in  Section  8.2. 

4.3  Object  Classification 

As  part  of  pre-processing  the  MIT  all  movers  and  static  objects  are  classified  using  a  variant  of  Caffe  adapted 
to  the  MSEE  object  hierarchy.  The  implementation  provided  by  ICSI  (co-PI  Darrell)  did  not  fully  implement 
the  hierarchy,  but  nevertheless  provided  reasonable  performance  on  many  objects  of  interest.  One  impact 
on  performance  is  that  only  the  highest  scoring  class  was  maintained  for  each  object.  As  such,  errors  in 
the  use  of  the  classfier  had  an  undue  impact  on  system  performance.  More  robust  performance  would  be 
obtained  if  a  full  or  partial  distribution  were  maintained  as  part  of  the  pre-processing.  This  is  feasible,  but 
would  complicate  query  processing  owing  to  the  increased  combinatorial  complexity.  Consequently,  the 
simple  approach  was  chosen  for  phase  2. 

Details  of  the  methodology  are  found  in  Section  8.3. 

5  Discussion 

5.1  Scene-wide  3D  reasoning  requires  significant  prior  knowledge  of  sensor  placement. 

As  described  in  the  formal  language  specification,  queries  and  associated  predicates  were  defined  as  rea¬ 
soning  over  a  scene  rather  than  a  sensor.  That  is,  the  collection  of  sensors  provides  observations  of  the  scene, 
but  the  scene  itself  may  not  be  limited  to  field-of-view  of  the  sensors.  Additionally,  many  predicates  (as 
defined)  require  extended  spatial  and  temporal  reasoning.  For  example,  the  predicate  “clear-line-of-sighf  ’  can 
potentially  be  used  to  reason  over  persons  (or  locations)  that  are  not  visible  in  the  same  sensor.  Furthermore, 
it  is  entirely  possible  that  one  could  be  interested  in  processing  this  particular  predicate  in  order  to  reason 
over  individuals  who  may  have  at  one  time  been  visible  in  different  sensors,  but  at  the  time  of  query  one  or 
both  individuals  may  no  longer  be  directly  observed.  This  does  not  preclude  processing  the  query.  As  part 
of  scene  understanding,  individuals  are  tracked  and  as  such,  even  when  not  directly  observed,  the  system 
has  some  information  as  to  their  location.  While  the  example  is  somewhat  extreme,  it  highlights  the  fact 
that,  as  defined,  reasoning  over  the  3D  geometry  of  the  scene  is  unavoidable  unless  one  knows  in  advance 
such  queries  will  not  be  utilized.  Many  predicates  implicitly  require  this  capability. 

The  importance  of  this  discussion  is  that  it  underlies  the  critical  need  for  knowledge  of  the  sensing 
geometry.  In  the  absence  of  this  information,  it  must  be  inferred.  In  many  cases  for  the  Phase  2  testing, 
the  information  was  not  adequately  provided.  For  example,  camera  locations  were  (roughly)  provided, 
but  direction  of  viewing  was  not.  Furthermore,  state-of-the-art  methods  for  finding  correspondences  also 
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proved  to  be  inadequate  for  inferring  the  scene  geometry  to  an  acceptable  quality  for  purposes  of  processing 
queries.  As  a  consequence,  a  manual  and  labor-intensive  process  was  necessary  in  order  to  accommodate 
the  potential  for  these  queries.  The  performers  had  no  way  of  knowing  ahead  of  time  whether  testing  queries 
would  require  this  level  of  reasoning.  It  is  the  opinion  of  the  PI  that  this  complication  was  unnecessary 
and  did  not  serve  the  goals  of  the  program. 

5.2  Significant  tradeoffs  for  state-of-the-art  video-based  object  tracking. 

Many  of  the  predicates,  especially  those  involving  gestures  or  actions,  require  some  segmentation  of  the  of  the 
body  pose.  Consequently,  this  project  chose  to  implement  a  video  tracking  algorithm  which  produced  accurate 
object  boundaries.  While  results  were  satisfactory,  real-time  performance  is  challenged  by  current  computat- 
ing  capabilities.  As  such,  tracking  speed  was  on  the  order  of  10-20  seconds  per  frame.  Some  gains  may  be 
achieved  by  better  utilzation  of  multi-core  processors  and/or  gpu  processing.  However,  in  the  current  frame¬ 
work,  object  tracking  is  performed  off-line  in  order  to  focus  on  reasoning  performance.  There  exist  video 
trackers  which  are  capable  of  tracking  objects  in  real-time,  however,  these  trackers  do  not  produce  boundary- 
accurate  results  and  furthermore,  do  not  perform  well  when  the  number  of  moving  objects  is  greater  than  ten. 

This  issue  might  be  mitigated  by  combining  fast  bounding  box  trackers  densely  and  boundary  accurate 
trackers  only  when  the  query  requires  it.  Implementation  of  such  a  scheme  was  entertained  in  the  original 
design,  but  it  was  felt  that  the  added  complexity  would  risk  successful  completion  of  a  working  system. 

5.3  Rolling  shutter  effect  significantly  degrade  moving  camera  analysis. 

For  moving  camera  data,  correspondences  across  frames  were  both  dense  and  fairly  robust.  However,  rolling 
shutter  artifacts,  which  manifest  themselves  as  the  image  appearing  to  warp  from  frame-to-frame,  result 
in  state-of-the-art  structure-from-motion  algorithms  generating  severely  degraded  results.  While  one  could 
incorporate  rolling  shutter  into  the  model,  to  do  so  was  beyond  the  scope  of  this  project. 

6  Students 

The  following  is  a  list  of  students  that  have  been  supported  by  the  project  listed  by  institution. 

6.1  MIT 

•  Randi  Cabezas  -  PhD  student  (due  to  graduate  Summer  2016) 

•  Jason  Chang  -  Completed  PhD,  now  at  Google 

•  Zoran  Dzunic  -  PhD  student  (due  to  graduate  Fall  2015) 

•  Oren  Freifeld  -  Postdoc 

•  Dan  Levine  -  Completed  PhD  student,  now  at  Jet  Propulsion  Laboratory 

•  Dahua  Lin  -  Completed  PhD  students,  now  professor  at  CUHK 

•  Guy  Rosman  -  Postdoc 

6.2  UCLA 

•  Avinash  Ravichandran  -  Completed  postdoc;  now  at  Amazon,  INC. 

•  Jonathan  Balzer  -  Completed  postdoc;  now  at  Vathos,  GmbH  (co-founder) 

•  Timothy  Brightbill  -  Completed  undergraduate  degree 

•  Joshua  Hernandez  PhD  student  (due  to  graduate  Summer  2015) 

•  Vasiliy  Karasev  PhD  student  (due  to  graduate  Summer  2015) 

•  Nikolaos  Karianakis  -  PhD  student 

•  Sim-LinLau  Staff  Researcher  Associate 

•  Stephen  Phillips  -  Completed  undergraduate  degree 

•  Siyang  Tang  Completed  MS  degree;  now  at  Apple,  INC. 

•  Brian  Taylor  PhD  student  (due  to  graduate  Fall  2015) 
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•  Chaohui  Wang  Completed  postdoc;  now  at  Max  Planck  Institute 

6.3  ETH 

•  Yuxin  Chen  -  PhD  student  (due  to  graduate  Summer  2016) 

6.4  ICSI 

•  Jiashi  Feng  -  Postdoc 

•  Eric  Tzeng  -  PhD  student 

•  Ross  Girshick  -  PhD  student 

7  Publications 

During  the  course  of  this  project,  the  PI  and  co-PIs  published  40  conference  and  journal  in  a  variety  of 
relevant  and  diverse  topics  including  Bayesian  nonparemetric  models,  system  control,  object  recognition, 
distributed  sensing,  Bayesian  inference,  tracking.  A  full  list  of  project-related  publications  is  maintained 
at  the  following  URL 

http : //projects . csail .mit . edu/csail-msee/pubs . html 

The  following  is  a  list  of  publications  funded  (or  partially  funded)  by  this  project  that  have  either  ap¬ 
peared  in  the  scientific  literature  (  or  are  pending  review). 
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8  Supplementary  Material 

The  following  includes  presentation  material  referenced  in  the  main  report. 

8.1  3D  Scene  Modeling 

The  methodology  for  constructing  3D  information  of  the  scene  is  described  in  the  material  provided  in  this 
section.  It  was  noted  during  the  course  of  the  program  that  the  quality  of  the  reconstruction,  upon  which  accu¬ 
rate  spatial  reasoning  depends,  is  impacted  accuracy  of  the  intrinsic  parameters  of  the  cameras  and  knowledge 
of  the  sensing  geometry.  The  former  was  provided,  but  the  latter  was  not.  Consequently,  state-of-the-art 
methods  employing  automatic  detection  of  correspondences  were  utilized.  The  accuracy  of  these  methods 
depends  greatly  on  both  the  sensor  geometry  and  the  content  of  the  scene.  For  some  of  the  scenes,  these 
were  not  adequate  to  yield  acceptable  performance  and  as  a  result,  manual  correspondences  were  needed. 
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agenda 

1.3-d  reconstruction  pipeline 

-  correspondence 

-  local  pose  estimation 

-  global  refinement 

-  gauge  fixation 

2 . analysis 


1.  reconstruction  pipeline 
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correspondence 

standard  approach: 

-  interest  point  detection 

-  SIFT  descriptors 

-  brute-force  matching 

-  homography  H^fe  :  P2  ->■  P2 

-  outlier  rejection  (RANSAC) 
if  that  fails: 

-  manual  correspondence 

-  DLT 

6/24 

relative  poses  (local) 

projection  matrices  Kii,KifeeM3x3 
Euclidean  homography  H]k  =Kr1n]kKi. 
four  decompositions  of  the  form 

Hi;  =R*‘+ntj®^T 

( twisted  pair) 

may  need  to  pick  solution  manually 
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planar  bundle  adjustment 


•  "world"  plane 
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GPS  coordinates 
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mutual  distances 


s' 


14/24 

gauge  fixation 

•  global  scale  is  just  the  "units" 

•  but : 

-  common  to  remote  pairs 

-  initialization  of  bundle  adjustment 

-  some  predicates  may  depend  on  it 

•  post-mortem  scale  adjustment 

a*  =  argmin  V"  -  atGPS,;||2 

aei  ^  2 
i 
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result 
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2.  analysis 
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correspondence 


•  co-visible  region 

•  texture 

•  shadows 

•  distortion 

18/24 


matching  performance 


Fall-out 
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priors 

•  planarity  assumption 

•  HUMINT 

-  co- visibility 

-  correspondence 

•  GPS  data 

-  unreliable  elevation 
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planarity  assumption 
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moving  cameras 
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Improved  Scene  Representation  in  MSEE  Phase  III 


i 


©Uncertainty  in  3D  representation. 

©  Multi-view  tracking. 

©More  complete  scene  understanding  for  solid  objects. 
©  More  complete  treatment  of  mobile  cameras. 


G.  Rosman  et  al  (MIT  CSAIL  SLI) 

Phase  3  2/25 

Uncertainty  in  3D  Reasoning 

Uncertainty  in  3D  Reasoning 


©Question:  How  uncertainty  in  3D  understanding  affects  predicates 
performance? 

©Affects  reconstruction,  tracking,  camera  positions,  object  positions,  etc. 

©  How  to  quantify  -  both  in  terms  of  algorithms,  experiments,  and  ground 
truth  data. 
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Uncertainty  in  3D  Reasoning 


Uncertainty  in  3D  Reasoning 


h  4 

i  •* 


9  3D  errors  mostly  created  between  image  processing  of  acquired  footage, 
image  correspondence,  and  3D  reconstruction  phases. 

«3D  uncertainty  propagates  to  the  predicates. 


Reconstruction 


.  OLeiie  iviuueimg 

" 

^  Tracking 


v _ y 
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Uncertainty  in  3D  Reasoning 


J I 

C  S  A  I  L 


a  3D  understanding  in  our  implementation  is  encapsulated  by  3D  wrapper 
functions: 

©Given  a  2D  point,  fetch  the  3D  location. 

©Can  include  uncertainty  estimates. 
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Uncertainty  in  3D  Reasoning 


Uncertainty  in  3D  Reasoning 


h  4 

i  •* 


Many  predicates  benefit  from  3D  reasoning: 

O  Clear  line  of  sight,  Occluding 
Q  Below,  On,  Closer,  Farther,  Together 

Q  Ru  nning,  Sitting,  Standing,  Stopping,  Turning,  Walking,  Crawling, 
Stationary,  Entering,  Exiting, 

©Some  of  them  are  relative,  and  some  are  absolute. 

©Tracking  is  key  for  most. 
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Uncertainty  in  3D  Reasoning 

Uncertainty  in  3D  Reasoning 


1*3 

m+8 


©The  desired  multiview  tracking  system  should  handle  tracked  objects  in 
0,1,2+  views. 

©Should  lend  itself  to  analysis,  prediction,  and  resource  allocation. 
©Some  views  are  more  informative  for  3D  location. 

©Some  views  may  be  informative  due  to  other  data  (appearance) 

©Some  view  pairs  are  more  informative. 
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Uncertainty  in  3D  Reasoning 


Uncertainty  in  3D  Reasoning 

Two  examples  of  camera  coverage  -  a  good  multiview  tracker  with 
uncertainty  should  cope  with  both! 


i  •» 


Time 
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Uncertainty  in  3D  Reasoning 

Geometric  Uncertainty  in  Predicates  Computation 


1*3 


Several  sources  of  for  predicate  errors  related  to  object  locations  -  among 
others: 

©Segmentation  errors 
©Tracking  errors 

©3D  camera  reconstruction  errors 
©3D  object  reconstruction  errors 
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Uncertainty  in  3D  Reasoning 


Geometric  Uncertainty  Sources  in  Predicates  Com- 
putation  l 

©Segmentation  errors  -  wrong  object  boundary. 


©Tracking  errors  -  loss  of  tracking  to  background,  switched  tracks,  tracks 
created  from  camera  artifacts. 


G.  Rosman  et  al  (MIT  CSAIL  SLI) 
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Uncertainty  in  3D  Reasoning 

Geometric  Uncertainty  Sources  in  Predicates  Com- 
putation  csail 


©3D  camera  reconstruction  errors  -  affect  multiple  objects. 


©3D  object  reconstruction  errors 
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Uncertainty  in  3D  Reasoning 


Geometric  Uncertainty  Sources  in  Predicates  Com- 
putation  l 


©In  many  cases,  2D/image-based  predicates  approximate  3D-based  ones. 

©They  work  better  than  3D  (as  we  tested..)  when  we  do  not  have  a  good 
3D  scene  model,  and  make  some  simplifying  assumptions  (i.e.  implicit 
priors) 


G.  Rosman  et  al  (MIT  CSAIL  SLI) 
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Uncertainty  in  3D  Reasoning 

Geometric  Uncertainty  Sources  in  Predicates  Com- 
putation  csail 


©  Modeling  3D  uncertainty  would  allow  us  to  get  the  best  of  both  worlds, 
by  accounting  both  for  error  given  3D  representation,  and  the 
representation  error. 

©Ample  test  data,  where  using  all  the  viewpoints  provides  a  stable  3D 
reconstruction/ “ground  truth”,  would  allow  us  to  quantify  that. 
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Uncertainty  in  3D  Reasoning 


Reconstruction  Error  Sources 


Mostly  -  errors  introduced  from  features 

©Small  scale  feature  localization  errors  -  these  relate  to  noise/artifacts 
and  inaccuracy  in  feature  localization 
©Correspondence  error  -  relate  to  mismatches  of  feature  points. 

»  Correspondences  are  usually  sampled  in  order  to  find  the  MAP  solution 
(RANSAC).  Correspondence  errors  often  lead  to  reconstruction  catastrophies. 
»  Correspondences  quality  is  a  known  question  in  comp,  vision,  with  strong 
effect  on  the  results. 
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Uncertainty  in  3D  Reasoning 

Reconstruction  Error  Sources 


1*3 


©  Image  correspondence  errors  -  less  common.  Avoiding  these  depends  on 
a  strongly  connected  scene  graph  with  many  overlaps.  Sensors 
GPS/location  helps  avoid  some  errors. 

©Reconstruction  packages  (such  as  VSFM)  provide  some  support  for 
dictating  image  correspondences. 


G.  Rosman  et  al 


(MIT  CSAIL  SLI) 
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Efficient  Multiview  Tracking  in  Complex  Scenes 


©Several  approaches  available  for  incorporating  multiple  views  into 
tracking  and  classification 

©In  many  cases,  track  loss  can  be  minimized  by  combining  hypotheses 
from  multiple  views. 

©This  includes  both  geometric  reasoning  (2D-3D  association)  and 
photometric  reasoning 

©  Regardless  of  the  specific  method  for  dealing  with  the  complexity  of  the 
space  (Pruning/MHT,  Sampling,  DP/MAP,  ..) 


G.  Rosman  et  al  (MIT  CSAIL  SLI) 

Phase  3  16/25 

Efficient  Multiview  Tracking  in  Complex  Scenes 

Incorporating  2D-3D  association 


©Track  then  reconstruct 


©  Reconstruct  then  track 


©General  association 
With  a  3D  representation  that  explains  2D  observations. 

9  Note  that  generative  models  lend  themselves  for  incorporating  multi-sensor 
and  multi-view  data. 

9  For  efficiency  reasons,  we  may  favor  2D  tracking,  followed  by  3D  association. 
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Scene  Representation 


Scene  representation 


m+l 


Scene  graph 

•  nodes  =  locations/camera  poses 

•  collection  of  photometric 
attributes  (features) 

•  edges  =  overlapping  views 


G.  Rosman  et  al  (MIT  CSAIL  SLI) 
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Scene  Representation 

Scene  representation 


mTg 


View  graph 

•  associated  with  a  node  of  the 
scene  graph 

•  nodes  =  geometric/photometric 
attribute 

•  edges  connecting  points  belonging 
to  the  same  surface 
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Scene  Representation 

\ 

UPO- 


Semantic  representation 


ECCV-14  submission  ID  774  11 


sm4- 


3.2 

4  C 


450  input  labels  provided  by  hand-drawn  bounding  boxes  for  small  subsets  of  the 

451  sequences  (approx,  one  tenth  of  the  frames  for  ‘paper  model’,  and  approx,  one 

452  third  of  the  frames  for  ‘castle’),  as  opposed  to  training  off-the-shelf  detectors 

®  Scene  represents® 

the  ‘pyramid’  label 

|  .  .  41,5  not  being  found.  This  is  a  failure  oase  due  to  under-segmeHtation  qf  the  .some, 

o  Objects  project 

457  cues  that  it  is  a  separate  object  from  the  table  top. 

segmentation.^ 


450 

451 

452 

453 

454 

object 

457 

458 

459 

460 

461 

462 

463 

464 

465 

466 

467 


470 

471 

472 
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477 
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479 
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486 

487 


G.  Rosman  et  al  (MIT  CSAIL  SLI) 


!  on  'nanor  mndnl’  sflonencfi.  Our  nutnut.  semantic  lahfdmr  (±rm)  jjjith 

_  MSEE  Phase  3 

i input  oDjec t  segmentation  ( second J ,  video  segmentation  results  trom  [t>2j  tuned  lor 
temporal  consistency  (third),  results  from  [52]  tuned  for  similar  number  of  segments 
to  our  results  (fourth),  single  image  segmentation  results  from  [55]  tuned  for  similar 
number  of  segments. 
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Filtering  for  representation 


Representation  inference 


i  •* 


9  “MAP”  approach: 

•  Geometry  reconstructed  through  one  of  many  variants  of  bundle  adjustment, 
o  No  topology,  no  uncertainty  estimate  in  the  reconstruction 

•  This  was  the  approach  adopted  in  first  evaluation  (see  below). 

©Bayesian  approach: 

•  Geometry  and  local  photometry  estimated  as  part  of  a  filtering  process. 

•  Allows  incorporation  of  inertial  sensing  priors. 

•  Benefits  from  continuous  camera  trajectories  (see  below). 

•  Provides  uncertainty  estimates  on  pose  as  well  as  scene  geometry. 
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Filtering  for  representation 

Corvis 


Boelter  Hall  loop  I 
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Filtering  for  representation 


Corvis 


Boelter  Hall  loop  II 
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Filtering  for  representation 

3D  scene  understanding  in  Phase  3  -  summary 


©Uncertainty  quantification  in  a  point-estimate  setting. 

©Incorporating  3D  reasoning  into  tracking. 

©Partition  the  scene  into  objects/primitives  (e.g.  groups  of  points  and 
their  connectivity) 

©Testing  of  filtering  approach  provided  sequences  are  given  with 
accurately  synchornized  video  taken  from  a  moving  platform  (e.g. 
quadrotor)  with  no  rolling  shutter  artifacts 
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8.2  Object  Tracking 

As  part  of  pre-processing  the  MIT  design  tracks  all  movers,  storing  the  results  in  a  database.  Both  the 
location  (within  the  sensor  view)  and  the  boundary  of  the  object  are  computed.  The  tracker  was  partially 
developed  under  the  MSEE  program  and  implements  layered  tracking,  adaptive  appearance  models,  and 
occlusion  reasoning. 
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Tracking:  Why  Do  We  Need  Tracking? 


Queries  over  time  windows  =>-  Need  data  association  across  frames 


Example:  how  many  cars  appear  in  the  sequence  above? 

To  report  the  right  answer  (one),  we  need  to  know  it  is  the  same  car. 


Chang  etal.  (MIT  CSAIL  SLI) 

Tracking 

Apr  24,  2014 

3/22 

The  Tracker:  A  Layered  Representation 


OiV  +  1  classes:  background  +  N  objects. 

Q  Object  j  is  represented  as  a  binary  mask,  denoted  Mj. 

O  Depth  ordering:  Z  is  permutation  of  {1, ... ,  N}.  E.g.  if  N  =  4  and 
Z  =  (1,3,4, 2),  then  object  2  is  the  closest  to  the  camera. 

Q  L(x )  G  {0, 1, . . . ,  N}\  pixel  label  at  location  x. 

If  maxj6{1;  jv}  Mj(x)  =  0  then  it  is  background:  L(x)  =  0.  Otherwise, 

L(x )  =  argmax  Z(j) 

{j-Mj(x)= 1} 


Chang  etal.  (MIT  CSAIL  SLI) 
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(video) 


Apr  24,  2014  8  /  22 


October  22,  2015 


40 


Approved  for  public  release;  distribution  unlimited. 


(MSEE)  Nonparametric  Representations  for  Integrated  Inference,  Control,  and  Sensing 


The  Tracker:  Probabilistic  Modeling 


m*8 


O  Binary  maps  updates:  Mj(x)  given  by 

latent  variables 

argmax  Pr(Mj(x)\I(x),  ^  ,  Mj~1(x)) 

Mj(x)£{ O;1}  ordering  appearance  velocity 

O  Appearance: 

models:  p(I(x)  \  A^\  L(x)  =  j)  parameters:  A  =  A^\  . . . ,  A^) 

Q  N  velocities:  v  =  (i/1), . . . ,  v^) 

O  Depth  ordering:  Z 


Chang  etal.  (MIT  CSAIL  SLI) 

Tracking 

Apr  24,  2014 

9/22 

Appearance 


Q  The  parameters:  A  =  (A^0),  A^\  . . . ,  A^) 
Q  (A)  A  pixel-wise  background  model: 


A'0)  (x) 

, - * - 

^4(°)  =J4W(a;)  p(I(x)\A^\  L{x)  =  0)  ~  J\f(p^°\x),  £^(a;)) 

(B)  Each  object  has  one  GMM  model: 

A^j^°{wij\p,ij\z  ^)}f=i 
p{I(x)\A^ ,L{x)  =j)  ~ 

k=  1 
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Appearance:  Initial  Background  Model 


Temporal  Median:  m(x) 
/j,(°\x)  m{x) 
s<0)W^SId5i=iE«(4W 


median(/t=i(x),/t=2(x), . . .) 
m(x))T  (It(x)  —  m(x)) 


Chang  etal.  (MIT  CSAIL  SLI) 

Tracking 

Apr  24,  2014 
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Velocity 

Object  velocity  implies  a  per-pixel  prior 


(i) 

0  vt_i :  velocity  of  object  i  between  frame  t  —  2  and  frame  t  —  1. 

0  Applying  v^\  to  M^~l  yields  a  new  mask  at  frame  t. 

Q  Distances  from  the  new  mask  are  used  to  (inversely)  weight  the  pixels. 

Q  Pixels  far  from  the  new  mask  are  unlikely  to  be  classified  as  object  j  at 
frame  t 


Chang  etal.  (MIT  CSAIL  SLI) 
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Ordering:  9ord 


m*8 


Q  Explicit  modeling  of  the  ordering  helps  to  deal  with  occlusions. 

Q  Z  =  a  permutation  of  {1, . . . ,  N} 

Q 

I  5  ,  Mi , . . . ,  Mn) 

color  appearance  ordering  object  masks 

Propose  z']  if  p(I\A,  Z',  {Mj}f=l)  >  p(I\A,  Z,  {Mj}f=1)  then  Z  <-  Z'. 

O  There  are  N\  options  -  but  we  only  need  to  consider  a  subset  of  these: 

If  objects  don’t  overlap,  their  depth  ordering  doesn’t  matter 


Chang  etal.  (MIT  CSAIL  SLI) 

Tracking 

Apr  24,  2014 

13  /  22 

Parameter  Updates 

666H=(i  C  S  A  I  L 


© 1}  is  determined  by  binary  masks  and  the  ordering. 

©Given  I,  Ll  can  estimate  new  A. 

Then  use  a  convex  combination  with  previous  estimates. 

E.g.,  A  =  a  x  Aold  +  (1  —  a)  x  Anew,  where  a  =  0.1. 

©Given  Mj_1  and  Mj  can  estimate  new  velocity. 
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Changing  the  Number  of  Objects 

Use  a  simple  heuristic  to  establish  N 


O  An  object  can  “die”  if  doesn’t  have  enough  image  evidence. 

Q  For  creating  new  objects,  we  consider,  among  the  pixels  labeled  as 
background,  the  connected  components  of  low- likelihood  pixels. 
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8.3  Object  Classification 

As  part  of  pre-processing  the  MIT  all  movers  and  static  objects  are  classified  using  a  variant  of  Caffe  adapted 
to  the  MSEE  object  hierarchy.  The  implementation  provided  by  ICSI  (co-PI  Darrell)  did  not  fully  implement 
the  hierarchy,  but  nevertheless  provided  reasonable  performance  on  many  objects  of  interest.  One  impact 
on  performance  is  that  only  the  highest  scoring  class  was  maintained  for  each  object.  As  such,  errors  in 
the  use  of  the  classfier  had  an  undue  impact  on  system  performance.  More  robust  performance  would  be 
obtained  if  a  full  or  partial  distribution  were  maintained  as  part  of  the  pre-processing.  This  is  feasible,  but 
would  complicate  query  processing  owing  to  the  increased  combinatorial  complexity.  Consequently,  the 
simple  approach  was  chosen  for  phase  2. 
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Traditional  Vision  Models... 


SIFT-VQ-BOW 


monkey? 

W&  \ 

' , 


classification 


feature  extr 


encodin 


V  \ 


Scanning 
Window  HOG 


Convolve-Quantize-Pool  ->  [Convolve-Quantize-Pool] 

...now,  CNN  ILSVRC  Architecture: 


Convolutional  Layers 


Fully-Connected 


3:: 


— 1 ...  yjr  155 


3 


048  \  /  2048  \dense 


Max 

pooling 


=4 


3 


-r 


128  Max 

pooling 


128  Max 
pooling 


L000 


2048  2048 


Convolve-Quantize-Pool  ->  [Convolve-Quantize-Pool]  ->  [[ Convolve-Quantize-Pool ]]->  ... 


Fukushima’s  Neocognitron  1974-82;  LeCun’s  LeNet,  1989; 

Krizhevsky,  A.,  Sutskever,  I.,  and  Hinton.,  G.  E.  ImageNet  Classification  with  Deep  Convolutional 
Neural  Networks.  In  Proc.  NIPS,  2012. 
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“Regions  with  CNN  features”  (R-CNN) 


1.  Input 
image 


\m- 

2.  Extract  region 
proposals  (~2k) 


warped  region 


aeroplane?  no. 


person?  yes. 


CNN;'. 


tvmonitor?  no. 


3.  Compute 
CNN  features 


4.  Classify 
regions 


(With  a  few  minor  tweaks: 
semantic  segmentation) 


(e.g.,  “selective  search”) 


Object  detection 


airplane,  bird,  motorbike,  person,  sofa  } 


Desired  output 


October  22,  2015 


47 


Approved  for  public  release;  distribution  unlimited. 


(MSEE)  Nonparametric  Representations  for  Integrated  Inference,  Control,  and  Sensing 


October  22,  2015 


48 


Approved  for  public  release;  distribution  unlimited, 


(MSEE)  Nonparametric  Representations  for  Integrated  Inference,  Control,  and  Sensing 


Compare  to  ground  truth 


D  ‘person’  detector  predictions 
D  ground  truth  ‘person’  boxes 


Sort  by  confidence 


true 

positive 

(high  overlap) 


false 

positive 

(low  overlap  or 
duplicate) 
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Average  Precision  (AP) 
0%  is  worst 
100%  is  best 


mean  AP  over  classes 

(mAP) 


PASCAL  VOC  Challenge 

Dataset:  22k  images,  50k  objects,  20  classes 


Eva  uation  metric 


0.4  0.5  0.6 

recall 


Detect:  people,  horses,  sofas,  bicycles,  pottedplants, ... 
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Progress  on  PASCAL  VOC 
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Progress  on  PASCAL  VOC 
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ImageNet  LSVR  Challenge 


1000  classes 
(vs.  20) 


1.2  million  training  images 
(vs.  10k) 


Image  classification 
(not  detection) 


bus  anywhere? 


[Deng etal.  CVPR’09] 
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Objective 


Can  the  Supervision  CNN  detect  objects? 


Proposed  system 


R-CNN  “Regions  with  CNN  features” 


is® 


1.  Input  2.  Extract  region 
image  proposals  (~2k) 


3.  Compute 
CNN  features 


tvmonitor?  no. 


4.  Classify 
regions 


[Girshick,  Donahue,  Darrell,  Malik 
to  appear  in  CVPR’14] 


“selective  search”  [van  de  Sande  et  al.  2011] 
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R-CNN  results  on  PASCAL 


VOC  2007 

VOC  2010 

DPM  v5  (Girshick  et  al.  2011) 

33.7% 

29.6% 

UVAsel.  search  (Uijlings  et  al. 

2012) 

35.1% 

Regionlets  (Wang  et  al.  2013) 

41.7% 

39.7% 

metric:  mean  average  precision  (higher  is  better) 

R-CNN  results  on  PASCAL 


VOC  2007 

VOC  2010 

DPM  v5  (Girshick  et  al.  2011) 

33.7% 

29.6% 

UVAsel.  search  (Uijlings  et  al. 

2012) 

35.1% 

Regionlets  (Wang  et  al.  2013) 

41.7% 

39.7% 

R-CNN 

54.2% 

50.2% 

R-CNN  +  bbox  regression 

58.5% 

53.7% 

metric:  mean  average  precision  (higher  is  better) 
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ImageNet  detection  (ILSVRC2013) 


R-CNN  and  OverFeat 

OverFeat  [Sermanet  etal.  2014] 

-  developed  using  ILSVRC2013 

-  tested  on  ILSVRC2013:  s-o-t-a 

-  no  results  on  PASCAL  VOC 

R-CNN  [Girshicketal.  2014] 

-developed  using  PASCAL  VOC 

-  tested  on  PASCAL  VOC:  s-o-t-a 

-  no  results  on  ILSVRC2013 


No  apples-to-apples  comparison 
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R-CNN  detector  training 


What  did  the  network  learn? 
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MSEE  Phase  2  RCNN  Component 

•  Leverage  I  mage  Net-derived  representation 
(from  Imagenet-IK) 

•  Use  all  ImageNet  classes  to  train  new  class  on 
top  of  R-CNN  model. 

•  Find  Nearest  class  in  Imagenet  to  MSEE 
ontology. 

•  Fixed  apriori  mapping  for  P2 

•  Significant  limitations:  no  Person  subclasses 

MSEE  Phase  3  RCNN  Plans 

•  Exploit  adaptation  (2  NIPS  2014  papers  in 
review) 

•  Take  in-domain  examples  as  well  as  ImageNet 
training  data 

•  Add  new  data  for  explicit  person  and  vehicle 
subclass 

•  Fast  training  on  the  fly 

•  Tree-based  loss  for  reasoning  within  hierarchy 
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4.  Detection  as  Adaptation: 
Generalizing  to  new  categories... 

•  NIPS  2014,  in  review. 

•  (To  be  released  on  arXiv,  ca.  July  2014) 


Detection  as  Adaptation 
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(b)  Hidden  Layer  Adaptation 


(c)  Output  Layer  Adaptation 


Results 


ILSVRC2013  mAP  Relative  to  Oracle 


] 


Qrecle:  Full 
Detection  Nei 


{Ours}  DNH 


CUssAcition  Net 


•  All  200  Categories 

*  IDO  Held  out  Categories 


0.2  0. 4  0.6  O.B 

mAP  relative  to  orade  full  detection-  net 


ILSVRC2013  Detection  mAP 


15  20  2S 

i[e  precision  (mAP}  % 
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Ablation  results 


Detection 

Output  Layer 

mAP  Trained 

mAP  Held-out 

mAP  All 

Adaptation  Layers 

Adaptation 

1 00  Categories 

100  Categories 

200  Categories 

No  Adapt  (Classification  Network) 

12,63 

10.31 

11.90 

- 

\4:*h  • 

12.22 

13.60 

fC6j/rm/«fc<j 

- 

24:.#: 

13.72 

19.20 

fc&r/rmi-fC7 

- 

2UX/  . 

14.57 

19.00 

tCfcjnui  ikfl 

- 

1&M-V  : 

*  11.74 

14.90 

lQ>«/rnd*fc6*fC7 

- 

25.78 

14.20 

20.00 

1^6grnd*fC(i.fC7.fCy 

- 

26.33 

14.42 

20.40 

fcfci/fmi. layers  1  -7,fc« 

- 

27.81 

15.85 

21.83 

fcfcprnddaycrs  1  -7,fc  b 

Avg  NN  (k=5) 

28.12 

15.97 

22.05 

fcfcpmd  .layers  1  -7,fc/j 

Avg  NN  (k=100) 

27.91 

15.96 

21.94 

Oracle:  Full  Detection  Network  |  29.72  26.25  |  28.00 


Table  1 :  Ablation  study  for  the  pieces  of  DNN.  We  consider  removing  different  pieces  of  our  algo¬ 
rithm  to  determine  which  pieces  are  essential.  We  consider  training  with  the  first  l('X)  (alphabetically) 
categories  of  the  ILSVRC2013  detection  validation  set  (on  yalJ)  and  report  mean  average  precision 
(mAP)  over  the  100  trained  on  and  100  held  out  categ£&$  Val2).  We  find  the  best  improvement 
is  from  fine-tuning  all  convolutional  fully  connected  fe^f^  ^Jf^'lising  output  layer  adaptation. 


Near  misses  of  adapted  models 
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October  22,  2015 


64 

Approved  for  public  release;  distribution  unlimited. 
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Localization  is  improved 


Localization  is  improved 


Held-out  Categories 


total  false  positives 


100 
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Detection  Summary  (RCNN  vs 

DPM) 

•  ~150%  improvement  in  raw  performance 
training  from  ImageNet  alone 

•  ~50%  improvement  in  raw  performance  when 
training  from  1—3  examples  in  domain 
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